knitr::include_graphics("C:/Users/hp/Downloads/data viz/thd-logo.jpg")
STUDENT: Shefali Badre ; MATRICULATION NUMBER: 22203735 ; EMAIL ID: shefali.badre@stud.th-deg.de
STUDENT: Shraddha Karambelkar ; MATRICULATION NUMBER: 22204857 ; EMAIL ID: shraddha.karambelkar@stud.th-deg.de
ABOUT PROJECT TYCHO
For this project we are using Project Tycho Level 2 Data Version 1.1.0 which contains 50 diseases from all 50 states in the USA (which is approximately 1284 US cities) from the year 1888 to 2014.[0] Project Tycho is global health data platform which collects public health data from different health agencies and researchers and use the data to improve health informatics. Its mainly focus is on infectious diseases and epidemics.
Measles is a highly contagious, serious disease caused by a virus of the paramyxovirus family[1]. The disease has an incubation period of 8 - 12 days, and then presents with increasing fever (up to 39 - 40.5 degrees celsius), cough, coryza and conjunctivitis. The symptoms intensify over 2 - 4 days before the onset of rash and peak on the day one of rash. The rash usually appears first on the face and neck as discrete erythematous patches. The number increases for 2 - 3 days, mostly on the trunk and face. Discrete lesions are also usually found on the distal extremities. The rash lasts for 3- 7 days before disappearing[2].
Since the introduction of measles vaccine in 1963, CDC (Center for Disease Control) of USA began its effort to fight this illness. By the year 2000 the efforts paid off and US declared measles eliminated.[0]
Even though a safe and cost effective vaccine is available, in 2021, there were an estimated 128,000 deaths globally, mostly among non-vaccinated children under the age of 5. In that same year, only about 81% of children received one dose of the measles vaccine by their first birthday, the lowest since 2008[3]. In the year 2022, in Columbus, Ohio, there was a Measles outbreak fueled by increasing vaccine hesitancy owing to increased anti-vaccine sentiment sparked by conspiracy theories surrounding the COVID-19 pandemic[4].
Our analysis is aimed at presenting cases before and after the introduction of the Measles Containing Vaccine (MCV) in the hopes that it will dispel the reservations of parents who might be hesitant about allowing their children to receive the vaccine.
Given that 2021 was the year with the lowest immunization rate for Measles in children globally since 2008, and that 2022 saw an outbreak of measles in the United States, owing to increased vaccine-skepticism, we present the incidence of Measles cases over the years in an attempt to let the data speak for itself to vaccine skeptics on the efficacy and necessity of measles vaccination.
We made use of the following packages for comprehensive analysis:
4.1. ANALYSIS
4.4.1 Data Loading and Data Clean-up
We load our data initially with an absolute path
dataTycho<-read.csv("C:/Users/hp/Downloads/data viz/ProjectTycho_Level2_v1.1.0.csv")
We get a look at the data inside our project Tycho dataframe
dataTycho
To clean up the data, we install the tidyverse package. We’ve commented the code out because in R for Data Science, we should avoid leaving in code that might change the settings on someone else’s computer if we mean to send our project to someone else.
#install.packages("tidyverse")
Here we load the tidyverse library
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
We filter the Measles data we need from the Tycho dataset using the filter() function of the tidy verse library and assign it to the dataframe “usMeasles”
usMeasles <- filter(dataTycho, disease == "MEASLES")
We take a peek at our usMeasles dataframe
usMeasles
We can see that the from_data and to_date variables are type character but we need them to be type date. To do this, we load the lubridate library
library(lubridate)
We use the ymd() function from lubridate in conjunction with the mutate() function from tidyverse to achieve our conversion of the from_date and to_date variables to data type date.
usMeasles <- mutate(usMeasles, from_date = ymd(from_date),
to_date = ymd(to_date))
We look into our usMeasles data frame again to check if from_date and to_date have been successfully converted. Their data type is now date.
usMeasles
We arrange the data by date using the arrange() function from tidyverse
usMeaslesArranged <- arrange(usMeasles, from_date)
We take a look at our data and ascertain that our data is now correctly arranged
usMeaslesArranged
We want to what US measles incidence per state is, so we need to group our data by state and get the total number of cases in each state. To do this, we use the group_by() and summarize() functions provided by tidyverse library and save our new data into the usMeaslesPerState dataframe
usMeaslesPerState <- usMeaslesArranged %>%
group_by(state) %>%
summarize(case_per_state = sum(number, na.rm = TRUE))
We checkout the usMeaslesPerState dataframe
usMeaslesPerState
To visualize our data, we require the usmap package
#install.packages("usmap")
We load our the usmap package
library(usmap)
## Warning: package 'usmap' was built under R version 4.3.2
The usmap package requires to variables to work. The first is fips, a representation of geographical locations in the United States with unique codes. The second variable is values, which in this case is the number of cases per state store in the case_per_state variable. We use the fips() function to convert our state into fips codes, then convert the fips codes into numeric data type and save it in a variable fips. We convert our case_per_state into numeric and save it in values.
usMeaslesPerState <- usMeaslesPerState %>%
mutate(fips = as.numeric(fips(state)),
values = as.numeric(case_per_state))
We check our update usMeaslesPerState data frame and can now see the additional variables fips and values
usMeaslesPerState
We want to remove any NA cells in the fips variable of our dataframe. We use the filter() function with the !is.na() function.
usMeaslesPerState <- usMeaslesPerState %>%
filter(!is.na(fips))
We check again to see that the NA values have been removed.
usMeaslesPerState
We can now call the plot_usmap() function on our usMeaslesPerState dataframe. What jumps out is that the states of New York, Pennsylvania, California and Texas have the highest incidence rates
plot_usmap(data = usMeaslesPerState) +
scale_fill_continuous(low = "#ebf059", high = "#f05959", name = "case_per_state", label = scales::comma) +
labs(title = "Number of Cases Per State") +
theme(legend.position = "right")
We can visualize this data better by using a bar chart. For that we only need our state and case_per_state variables
usMeaslesPerState <- usMeaslesPerState %>%
select(state, case_per_state)
We check to see if our dataframe contains only the state and case_per_state variables
usMeaslesPerState
4.1.2 Data Analysis of our DataSet
We can now plot a bar chart on our dataframe. As stated earlier, we can see that New York, Pennsylvania and California have the highest peaks. The common factor amongst all these states is that they have very high populations. so the high incidence of measles cases there is not surprising giving their high populations
ggplot(data = usMeaslesPerState, mapping = aes(x = state, y = case_per_state, fill = state)) +
geom_bar(stat = "identity") +
labs(title = "Total Number of Cases Per State")
Now we proceed to our mission of showcasing the incidence of measles over time. We group our usMeaslesArranged dataframe by year and summarize by total number of cases in each year and save it into a new dataframe
usMeaslesPerYear <- usMeaslesArranged %>%
group_by(year = year(from_date)) %>%
summarize(case_per_year = sum(number))
We checkout the usMeaslesPerYear dataframe
usMeaslesPerYear
We also checkout where our usMeaslesPerYear dataframe ends. The last entry is 2001
tail(usMeaslesPerYear, 10)
We use a bar plot to visualize our data. We can see that the years 1928 to 1966 had very high incidence rates, and then 1966 and onwards saw a sharp decline in the number of measles cases. It should be noted that the measles vaccine was first introduced in 1963 by John Enders but then a better version was introduced in 1968 by Maurice Hillman[5]. In our barplot, we observe a steep fall in the number of measles cases after the introduction of the Maurice Vaccine.
ggplot(usMeaslesPerYear, mapping = aes(x = year, y = case_per_year, fill = year)) +
geom_bar(stat = "identity") +
labs(title = "Measles Cases Per Year")
We take a closer look at the 1980s to 2000s which are hard to see on the barplot
usMeasles80s <- usMeaslesPerYear %>%
filter(year >= 1981 & year <= 2001)
We use filter() function to get data from the year 1981 to 2001, save it in the usMeasles80s dataframe and take a look at it
usMeasles80s
We use ggplot to visualize the data and save it into the usMeasles80sPlot dataframe
usMeasles80sPlot <- ggplot(usMeasles80s, aes(x = year, y = case_per_year, fill = year)) +
geom_area() +
geom_line() +
labs(title = "Measles Cases Per Year 1981-2000")
We checkout the usMeasles80sPlot and can see that there was a slight peak in cases in the late 80s
usMeasles80sPlot
4.1.3 Data Loading and Clean-Up of WHO vaccination data
We got some data on measles vaccinations from the WHO website in excel format. To open it, we install the readxl package
#install.packages("readxl")
We load our readxl package
library(readxl)
We load our excel data into the measlesVacc dataframe initally using the absolute path.
measlesVacc <- read_excel("C:/Users/hp/Downloads/data viz/measles_vaccination1.xlsx")
## New names:
## • `` -> `...68`
We convert our excel data into csv data
write.csv(measlesVacc, "measles_vaccination1.csv")
Note That: We manually cleaned up the measles_vaccination1.csv file with our text editor, removed unwanted meta data and renamed the file measles_vaccination3.csv
measlesVacc2 <- read.csv("measles_vaccination3.csv")
We read the csv data and save it into the measlesVacc2 data frame
measlesVacc2
Our data contains vaccination data from many countries but we are only concerned with the United States so we filter out the data we need and save it in usVaccData
library(dplyr)
usVaccData <- filter(measlesVacc2, CountryCode == "USA")
We check our usVaccData
usVaccData
We check the class of the variables in usVaccData
sapply(usVaccData, class)
## CountryName CountryCode Indicator.Name Indicator.Code X1960
## "character" "character" "character" "character" "logical"
## X1961 X1962 X1963 X1964 X1965
## "logical" "logical" "logical" "logical" "logical"
## X1966 X1967 X1968 X1969 X1970
## "logical" "logical" "logical" "logical" "logical"
## X1971 X1972 X1973 X1974 X1975
## "logical" "logical" "logical" "logical" "logical"
## X1976 X1977 X1978 X1979 X1980
## "logical" "logical" "logical" "logical" "numeric"
## X1981 X1982 X1983 X1984 X1985
## "numeric" "numeric" "numeric" "numeric" "numeric"
## X1986 X1987 X1988 X1989 X1990
## "numeric" "numeric" "numeric" "numeric" "numeric"
## X1991 X1992 X1993 X1994 X1995
## "numeric" "numeric" "numeric" "numeric" "numeric"
## X1996 X1997 X1998 X1999 X2000
## "numeric" "numeric" "numeric" "numeric" "numeric"
## X2001 X2002 X2003 X2004 X2005
## "numeric" "numeric" "numeric" "numeric" "numeric"
## X2006 X2007 X2008 X2009 X2010
## "numeric" "numeric" "numeric" "numeric" "numeric"
## X2011 X2012 X2013 X2014 X2015
## "numeric" "numeric" "numeric" "numeric" "numeric"
## X2016 X2017 X2018 X2019 X2020
## "numeric" "numeric" "numeric" "numeric" "numeric"
## X2021 X2022
## "numeric" "logical"
The early years in usVaccData have a lot of NA values so we select only data from 1981 onwards
usVaccDataFinal <- select(usVaccData, -(X1960:X1980))
We checkout usVaccDataFinal
usVaccDataFinal
Now we have immunization data from 1981 to 2001. However, the data is not entered a usable format for us. To fix this, we use tribble() function to create a new dataframe with the same information in usVaccDataFinal but in a format that is more amenable to use.
usMeaslesVaccRate <- tribble(~years, ~immunization_rate,
1981, 97,
1982, 97,
1983, 98,
1984, 98,
1985, 97,
1986, 97,
1987, 82,
1988, 98,
1989, 94,
1990, 90,
1991, 87,
1992, 83,
1993, 84,
1994, 89,
1995, 88,
1996, 91,
1997, 91,
1998, 92,
1999, 92,
2000, 91,
2001, 91)
We now have a new dataframe called usMeaslesVaccRate with two variables, years and immunization_rate
usMeaslesVaccRate
We use a ggplot to visualize the data and save it in usMeaslesVaccRatePlot
usMeaslesVaccRatePlot <- ggplot(usMeaslesVaccRate, aes(x = years, y = immunization_rate, fill = years)) +
geom_area() +
geom_line() +
labs(title = "Measles Immunization Rate from Years 1981-2001")
4.1.4 Data Analysis of our Measles Vaccination Dataset
We chekout what the usMeaslesVaccRatePlot looks like. We can see that there is a small decline in the vaccination rate in the late 80s
usMeaslesVaccRatePlot
4.1.5 Comparison of Measles Incidence Plot and Measles Vaccination Rate Plot
We load the patchwork libray so we can get the usMeasles80sPlot and usMeaslesVaccRatePlot side by side
library(patchwork)
## Warning: package 'patchwork' was built under R version 4.3.2
We compare both plots and can see that the sligh peak in masles cases in the late 80s overlaps with the slight decrease in vaccination rates in the late 80s
usMeasles80sPlotAndusMeaslesVaccRatePlot <- usMeasles80sPlot + usMeaslesVaccRatePlot
usMeasles80sPlotAndusMeaslesVaccRatePlot + plot_layout(heights = c(3,1))
4.1.6. Statistical Analysis of Measles Incidence Rate with Vaccination Rate
First, we section some data that we need. We get the incidence of measles cases from 1931 to 2001. However, since we know our vaccination data lacks entries for the early periods of vaccination, ie, from 1963 to 1979, we have to remove those years from our incidence data too.
usMeasles30s <- usMeaslesPerYear %>%
filter(year >= 1931 & year <= 2001 & !(year >= 1963 & year <= 1979))
We know that before the introduction of the measles vaccine to the public in 1963, and again in 1968, there were no children vaccinated against measles. That means the vaccination rate against measles before 1963 is 0%.
usMeaslesVaccRate30s <- tribble(~years, ~immunization_rate,
1931, 0,
1932, 0,
1933, 0,
1934, 0,
1935, 0,
1936, 0,
1937, 0,
1938, 0,
1939, 0,
1940, 0,
1941, 0,
1942, 0,
1943, 0,
1944, 0,
1945, 0,
1946, 0,
1947, 0,
1948, 0,
1949, 0,
1950, 0,
1951, 0,
1952, 0,
1953, 0,
1954, 0,
1955, 0,
1956, 0,
1957, 0,
1958, 0,
1959, 0,
1960, 0,
1961, 0,
1962, 0,
1963, NA,
1964, NA,
1965, NA,
1966, NA,
1967, NA,
1968, NA,
1969, NA,
1970, NA,
1971, NA,
1972, NA,
1973, NA,
1974, NA,
1975, NA,
1976, NA,
1977, NA,
1978, NA,
1979, NA,
1980, 97,
1981, 97,
1982, 97,
1983, 98,
1984, 98,
1985, 97,
1986, 97,
1987, 82,
1988, 98,
1989, 94,
1990, 90,
1991, 87,
1992, 83,
1993, 84,
1994, 89,
1995, 88,
1996, 91,
1997, 91,
1998, 92,
1999, 92,
2000, 91,
2001, 91)
We’ve assinged 0% as the immunization rate for years before 1963, and NA for the years 1963 to 1979 because even though vaccinations were available in those years, we do not have the data. But to proceed, we must remove the NA values from our data.
usMeaslesVaccRate30s <- usMeaslesVaccRate30s %>%
filter(!is.na(immunization_rate))
We check out usMeasles30s
usMeasles30s
We check out usMeaslesVaccRate30s
usMeaslesVaccRate30s
We do a correlation test using Pearson’s product moment test. The value of our correlation coefficient is -0.8 which indiates that there is a negative correlation between the incidence of measles cases and immunization rate. Our p value also indicates that the correlation is statistically significant.
cor.test(usMeasles30s$case_per_year, usMeaslesVaccRate30s$immunization_rate)
##
## Pearson's product-moment correlation
##
## data: usMeasles30s$case_per_year and usMeaslesVaccRate30s$immunization_rate
## t = -12.273, df = 52, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9180150 -0.7728455
## sample estimates:
## cor
## -0.8621872
4.1.7 Modeling
We want to model our y dependent variable (case_per_year) on our independent variable x variable (immunization_rate). for this, we create a new dataframe called modelData
modelData <- usMeasles30s %>%
select(y = case_per_year) %>%
mutate(x = usMeaslesVaccRate30s$immunization_rate)
We fit a linear regression on our modelData
modelDataMod <- lm(y ~ x, data = modelData)
We find the intercept and coefficient
coef(modelDataMod)
## (Intercept) x
## 585953.839 -6326.999
To make predictions, we need to load the modelr pacakge
library(modelr)
## Warning: package 'modelr' was built under R version 4.3.1
We use the data_grid() function to generate an evenly spaced grid of values that covers the region where out data lies
grid <- modelData %>%
data_grid(x)
grid
Next we use the add_predictions function to add predictions
grid <- grid %>%
add_predictions(modelDataMod)
grid
We plot our predictions using ggplot2
ggplot(modelData, aes(x)) +
geom_point(aes(y = y)) +
geom_line(aes(y = pred), data = grid, color = "red", size = 1)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
We add residuals to tell us what our model has missed
modelData <- modelData %>%
add_residuals(modelDataMod)
modelData
We visualize our residuals with a frequency polygon
ggplot(modelData, aes(resid)) +
geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
RESULTS
Our data shows that the introduction of the Maurice Hillman vaccine in 1968 caused a sharp decline in the incidence of measles in the United States. Our data also shows that a slight decrease in the immunization rate in the late 80s coincided with a slight increase in measles cases over the same time period.
DISCUSSION
We had to get a bit creative with our statistical analysis in 4.1.6.However, assigning 0% as the vaccination rates for the years before the licensing of any measles vaccine makes sense. And when we do that, we can see that there is a strong negative correlation between measles incidence and vaccination rates.
CONCLUSION
Our aim is to present the data to vaccination skeptics and show them how the discovery and introduction of the measles vaccine caused a sharp decrease in the incidence of measles in the late 60s. Millions of lives have been saved by the measles vaccine since its discovery. If measles vaccination rates continue to fall in the aftermath of the covid-19 pandemic, it will have real consequences for public health.
CITATIONS
[0]Centers for Disease Control and Prevention Measles cases in 2022, from https://www.cdc.gov/measles/cases-outbreaks.html
[1] World Health Organization. (n.d.). Measles. Retrieved June 18, 2023, from https://www.who.int/news-room/fact-sheets/detail/measles
[2] Robert T. Perry , Neal A. Halsey, The Clinical Significance of Measles: A Review, The Journal of Infectious Diseases, Volume 189, Issue Supplement_1, May 2004, Pages S4–S16, https://doi.org/10.1086/377712
[3]World Health Organization. (n.d.). Measles. Retrieved June 18, 2023, from https://www.who.int/news-room/fact-sheets/detail/measles
[4] Lena H Sun. The Washington Post. (2022, December 26). Vaccine hesitancy: Measles, chickenpox, polio, and flu. Retrieved June 18, 2023, from https://www.washingtonpost.com/health/2022/12/26/vaccine-hesitancy-measles-chickenpox-polio-flu
[5]“Measles vaccine.” Wikipedia, The Free Encyclopedia. Wikimedia Foundation, Inc. Retrieved June 18, 2023, from https://en.wikipedia.org/wiki/Measles_vaccine